Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

64 ◾ Bioinformatics

studied. To overcome this challenge, aligners must adopt a strategy to perform the gapped

alignment rather than performing only the exact alignment.

Read mapping is required by the most sequencing applications including reference-based

genome assembly, variant discovery, gene expression, epigenetics, and metagenomics. As

shown in Figure 2.13, the workflow of read mapping/alignment includes downloading the

right FASTA file of the reference genome of the species studied, indexing the sequence of

the reference genome with “samtools index” command, indexing the reference genome

with an aligner, and finally performing mapping of the cleaned reads with the aligner

itself. Remember that, before mapping, the step of quality control must be performed as

discussed in Chapter 1. So, when we move to the step of read mapping, we should have

already cleaned up the reads and fixed most of the failed metrics shown by the QC reports.

Even if we couldn’t fix all errors, we should also be aware of the effect of that in the final

results.

We have also discussed above how to download a reference genome of an organism and

how to index it. To avoid repetition, we will delve into read mapping and generation of

SAM/BAM files without covering the topics that we have already discussed.

Read mapping is the process of finding locations on a reference genome where reads,

contained in FASTQ files, map. The read mapping information are then stored in a

SAM/BAM file format, which is a special file format for storing sequences alignment

FIGURE 2.13 The general workflow of the read mapping.